Exercise 3.1

3.1. The UC Irvine Machine Learning Repository1 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

RI Na Mg Al Si
Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290 Min. :69.81
1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190 1st Qu.:72.28
Median :1.518 Median :13.30 Median :3.480 Median :1.360 Median :72.79
Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445 Mean :72.65
3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630 3rd Qu.:73.09
Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500 Max. :75.41
K Ca Ba Fe Type
Min. :0.0000 Min. : 5.430 Min. :0.000 Min. :0.00000 1:70
1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000 1st Qu.:0.00000 2:76
Median :0.5550 Median : 8.600 Median :0.000 Median :0.00000 3:17
Mean :0.4971 Mean : 8.957 Mean :0.175 Mean :0.05701 5:13
3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000 3rd Qu.:0.10000 6: 9
Max. :6.2100 Max. :16.190 Max. :3.150 Max. :0.51000 7:29

There 9 numerical predictors (RI, Na, Mg, Al, Si, K, Ca, Ba, and Fe) and one categorical target with 6 levels (Type). The data set is highly skewed toward categories 1 and 2 with 146 out of 214 observations falling into those two categories alone. Ba, and Fe have their first quartile value at 0 indicating that they have a lot of zeros in their distributions.

Histograms of the predictors show that all of them except maybe one, Si are moderately to highly skewed. The histogram below has been colored by the target categories and shows that the distributions for different targets values are slightly different (especially for Al and Na) indicating that those variables may be good predictors.

Those differences in the distributions of the predictors when separated by the target variable are even more evident in the bar plots below. All of the predictors show a lot of variation in their distributions for each target value. We can also see a lot of outliers which are indicated by the red dots to the top and bottom of the plot ‘whiskers’.

Only RI and Ca seem to be very highly correlated with each other with a correlation of about 0.81.

## [1] 7

The findCorrelation function suggests removing column 7 which corresponds to Ca.

There are no missing values in the dataset.

RI Na Mg Al Si K Ca Ba Fe Type
NA’s 0 0 0 0 0 0 0 0 0 0

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

As noted above most if not all of the predictors have both outliers and skewed distributions as well as a large percentage of zeroes in a few predictors.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

skewness_statistic ratio_max_to_min
RI 1.6027151 1.015075
Na 0.4478343 1.619758
Mg -1.1364523 Inf
Al 0.8946104 12.068965
Si -0.7202392 1.080218
K 6.4600889 Inf
Ca 2.0184463 2.981584
Ba 3.3686800 Inf
Fe 1.7298107 Inf

Because there are so many zero values in the data it’s hard to estimate the severity of the skew using the max to min ratio since division by zero is undefined. But we clearly have some large skewness values and can clearly see the skew in the histograms.

The skewed distributions might be improved by log, square root, or inverse transformations or a Box-Cox transformation.

Unfortunately because of the zero values Mg, K, Ba, and Fe cannot be transformed using Box-Cox. I’m still not sure how to handle that although I found this post by Rob Hyndman on the subject… Transforming data with zeros.

RI Na Mg Al Si K Ca Ba Fe
0.2838746 2.613007 4.49 0.0976177 2575.684 0.06 0.8254539 0 0.00
0.2829051 2.631169 3.60 0.3323808 2644.326 0.48 0.8145827 0 0.00
0.2824954 2.604909 3.55 0.4819347 2663.270 0.39 0.8139144 0 0.00
0.2829194 2.580974 3.69 0.2715633 2635.606 0.57 0.8195032 0 0.00
0.2828507 2.585506 3.62 0.2271057 2669.843 0.55 0.8176698 0 0.00
0.2824323 2.548664 3.61 0.5455844 2661.810 0.64 0.8176698 0 0.26

I’m not seeing a lot of improvement in the histograms after transformation with Box-Cox so let’s try something else… Another post suggests three ways of handling the zeros:

  1. Add a constant value © to each value of variable then take a log transformation
  2. Impute zero value with mean
  3. Take square root instead of log for transformation2

Let’s try the square root transformation first.

The square root transformation seems to have done a little bit better at taming our distributions, however they are still highly skewed in some cases.

Let’s try one more of the suggestions by using \(log(x+1)\) to transform our data.

Maybe a little better, but still not great!

freqRatio percentUnique zeroVar nzv
RI 1.000000 83.177570 FALSE FALSE
Na 1.000000 66.355140 FALSE FALSE
Mg 5.250000 43.925234 FALSE FALSE
Al 1.333333 55.140187 FALSE FALSE
Si 1.000000 62.149533 FALSE FALSE
K 2.500000 30.373832 FALSE FALSE
Ca 1.000000 66.822430 FALSE FALSE
Ba 88.000000 15.887851 FALSE FALSE
Fe 20.571429 14.953271 FALSE FALSE
Type 1.085714 2.803738 FALSE FALSE

The caret function nearZeroVar does not indicate that the variabes with high frequenciees of zeroes should be removed so another solution might be to try the alternative Box-Cox transformation that Hyndman suggested in his blog post.3

RI Na Mg Al Si K Ca Ba Fe
0.9246596 2.683758 1.702928 0.7419373 4.287441 0.0582689 2.277267 0 0.0000000
0.9233100 2.700690 1.526056 0.8586616 4.300410 0.3920421 2.178155 0 0.0000000
0.9227419 2.676216 1.515127 0.9321641 4.303930 0.3293037 2.172476 0 0.0000000
0.9233299 2.653946 1.545433 0.8285518 4.298781 0.4510756 2.221375 0 0.0000000
0.9232346 2.658159 1.530395 0.8064759 4.305146 0.4382549 2.204972 0 0.0000000
0.9226544 2.623944 1.528228 0.9631743 4.303660 0.4946962 2.204972 0 0.2311117

I think that in this case it mght be necessary to know if these are real zeroes, or ‘censored’ data where the zeroes may actually be trace amounts that are below the detection limit. In this case Kuhn and Johnson state:

when a sample has a value below the limit of detection, the actual limit can be used in place of the real value. For this situation, it is also common to use a random number between zero and the limit of detection.4

Exercise 3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Class date
brown-spot : 92 5 :149
alternarialeaf-spot: 91 4 :131
frog-eye-leaf-spot : 91 3 :118
phytophthora-rot : 88 2 : 93
anthracnose : 44 6 : 90
brown-stem-rot : 44 (Other):101
(Other) :233 NA’s : 1
crop.hist area.dam stem.cankers canker.lesion fruit.pods fruit.spots
0 : 65 0 :123 0 :379 0 :320 0 :407 0 :345
1 :165 1 :227 1 : 39 1 : 83 1 :130 1 : 75
2 :219 2 :145 2 : 36 2 :177 2 : 14 2 : 57
3 :218 3 :187 3 :191 3 : 65 3 : 48 4 :100
NA’s: 16 NA’s: 1 NA’s: 38 NA’s: 38 NA’s: 84 NA’s:106
precip temp sever seed.tmt germ leaf.halo
0 : 74 0 : 80 0 :195 0 :305 0 :165 0 :221
1 :112 1 :374 1 :322 1 :222 1 :213 1 : 36
2 :459 2 :199 2 : 45 2 : 35 2 :193 2 :342
NA’s: 38 NA’s: 30 NA’s:121 NA’s:121 NA’s:112 NA’s: 84
leaf.marg leaf.size leaf.mild ext.decay int.discolor roots
0 :357 0 : 51 0 :535 0 :497 0 :581 0 :551
1 : 21 1 :327 1 : 20 1 :135 1 : 44 1 : 86
2 :221 2 :221 2 : 20 2 : 13 2 : 20 2 : 15
NA’s: 84 NA’s: 84 NA’s:108 NA’s: 38 NA’s: 38 NA’s: 31
plant.stand hail plant.growth leaves leaf.shread leaf.malf stem lodging
0 :354 0 :435 0 :441 0: 77 0 :487 0 :554 0 :296 0 :520
1 :293 1 :127 1 :226 1:606 1 : 96 1 : 45 1 :371 1 : 42
NA’s: 36 NA’s:121 NA’s: 16 NA NA’s:100 NA’s: 84 NA’s: 16 NA’s:121
fruiting.bodies mycelium sclerotia seed mold.growth seed.discolor seed.size shriveling
0 :473 0 :639 0 :625 0 :476 0 :524 0 :513 0 :532 0 :539
1 :104 1 : 6 1 : 20 1 :115 1 : 67 1 : 64 1 : 59 1 : 38
NA’s:106 NA’s: 38 NA’s: 38 NA’s: 92 NA’s: 92 NA’s:106 NA’s: 92 NA’s:106

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Footnotes